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Sparse Bayesian factor models are routinely implemented for par- 
simonious dependence modeling and dimensionality reduction in high- 
dimensional applications. We provide theoretical understanding of 
such Bayesian procedures in terms of posterior convergence rates in 
inferring high-dimensional covariance matrices where the dimension 
can be potentially larger than the sample size. Under relevant sparsity 
assumptions on the true covariance matrix, we show that commonly- 
used point mass mixture priors on the factor loadings lead to consis- 
tent estimation in the operator norm even when p>n. One of our 
major contributions is to develop a new class of continuous shrink- 
age priors and provide insights into their concentration around sparse 
vectors. Using such priors for the factor loadings, we obtain the same 
rate as obtained with point mass mixture priors. To obtain the con- 
vergence rates, we construct test functions to separate points in the 
space of high-dimensional covariance matrices using insights from 
random matrix theory; the tools developed may be of independent 
interest. 



1. Introduction. It is now routine to collect data where the dimension 
p is much larger than the sample size n, and interest focuses on the covariance 
structure. In this context, even a simple parametric model like the Gaussian 
distribution leads to a high-dimensional model space, since an unstructured 
p x p covariance matrix has 0(p 2 ) free parameters. It is thus necessary to 
reduce the effective number of parameters via imposing sparsity or some 
lower-dimensional structure. Sparse Bayesian factor models (West, 2003) 
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provide one popular choice in applications, but currently lack theoretical 
support. In this paper, we close this gap by studying asymptotic properties 
for scenarios in which p grows faster than n. 

Factor models (Bartholomew, 1987) aim to explain dependence among 
multivariate observations through shared dependence on a smaller number 
of latent factors. Given n i.i.d. observations yi £ M p , the generic form of a 
latent factor model is 

(1.1) yi = fi + Ar]i + ei, ti ~ N p (0, 0), i = l,...,n, 

where [i is an intercept term, A is a p x k factor loadings matrix with k <C p, 
iji ~ Nfc(0, /) are standard normal latent factors, and is a residual having 
diagonal covariance Q = diag(cr 2 , . . . , cr 2 ). We follow standard practice in 
centering the data prior to analysis and henceforth shall set fi = in (1.1). 
Marginalizing out the latent factors, m ~ N p (0,£) with 

(1.2) X = AA T + ft, 

which has at most p{k + 1) parameters, resulting in huge reduction in model 
complexity. 

There is a sizeable literature studying asymptotic properties of various 
aspects of factor analysis, including consistent estimation of factor load- 
ings and latent factors (Bai, 2003) and the number of factors (Bai and Ng, 
2002; Lam and Yao, 2012). Fan, Fan and Lv (2008) studied rates of conver- 
gence of high-dimensional covariance estimates based on factor models, with 
Fan, Liao and Mincheva (2011) extending their results to approximate fac- 
tor models that allow non-diagonal VL in (1.2). This work assumes that the 
factor scores r\i are known, while we consider the fundamentally different 
setting in which the factor scores are unknown while also studying concen- 
tration of a Bayesian posterior instead of convergence of a point estimate. 

A prior distribution on the loadings and the residual variances induces a 
prior distribution on the space of covariance matrices, and we are specifically 
interested in studying concentration of the corresponding posterior measure 
around the "true" covariance matrix. When the parameter space is finite 
dimensional, it is well known that the posterior contracts at the parametric 
rate of n -1 / 2 under mild regularity conditions (Ghosal, Ghosh and van der Vaart, 
2000). However, we are interested in the asymptotic framework of the di- 
mension p = p n growing with the sample size n and hence the classical 
results do not apply to our case. Although this setting has motivated abun- 
dant frequentist work, relatively little has been done in the Bayesian setting, 
with most of the focus being on linear regression and the closely-related nor- 
mal means problem; relevant references include Armagan, Dunson and Lee 
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(2011) ; Belitser and Ghosal (2003); Bontemps (2011); Castillo and van der Vaart 

(2012) ; Ghosal (1997) among others. In fact, to best of our knowledge, ours 
is the first paper which studies the asymptotic properties of Bayesian co- 
variance estimation in this context. 

Now we summarize the main results obtained in this paper. We begin with 
the study of a moderately-high- dimensional setting where p n grows slower 
than n. A Bayesian specification (Arminger and Muthen, 1998; Song and Lee, 
2001) of the factor model for such applications commonly uses inverse gamma 
priors on the residual variances and normal prior on the loadings. We show 
in Theorem 3.3 below that such priors lead to a posterior contraction rate of 
y/pn/n in the Frobenius norm whenever the true covariance underlying the 
data generating mechanism admits a factor decomposition as in (1.2) with 
the number of factors k = k n = 0(1). Thus even if p n is allowed to grow 
with n, we obtain posterior consistency as long as pZ/n — >■ for some 7 > 1. 

The second set of results pertain to the more interesting case, p n S> n. 
In this regime, we shall consider a weaker notion of discrepancy, namely the 
operator norm or equivalently the largest eigenvalue. Although the origi- 
nal specification of the factor model reduces the number of variables from 
0(p„) to 0(p n ), the estimation problem is still challenging when p n ^> n. 
To address this challenge, West (2003) introduced sparse factor modeling 
to allow many of the loadings to be exactly equal to zero through a point 
mass mixture prior; see also Carvalho et al. (2008); Lucas et al. (2006). We 
show in Theorem 3.5 that, for appropriate point mass priors, the posterior 
distribution contracts at a rate of 0(y/ (log p n ) 7 jn) in the operator norm if 
the true sequence of covariance matrices admit a factor type decomposition 
(1.2) with k n = 0(1) many factors and some realistic sparsity assumption 
on the loadings. Thus, we obtain consistency as long as p n = 0(e nQ ) for 
some a 6 (0,1/7). This is particularly appealing since the dimensionality 
affects the rate only through a logarithmic factor and thus provides a theo- 
retical validation of the efficacy of sparse factor models for high-dimensional 
covariance matrix estimation for p n S> n. 

Our final set of results concern with developing continuous shrinkage pri- 
ors that achieve the same rate of convergence as that of the point mass 
mixture priors. Although point mass mixture priors on factor loadings are 
intuitively appealing and possess attractive theoretical properties, they lead 
to daunting posterior computation, with typical MCMC algorithms for up- 
dating elements of the loadings matrix one at a time facing problems with 
slow convergence and mixing. To address such problems through block up- 
dating, while allowing a weaker notion of sparsity in which elements are close 
to zero instead of exactly zero, continuous shrinkage priors can be used. Such 
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priors have become common in regression (Armagan, Dunson and Lee, 2011; 
Carvalho, Poison and Scott, 2010; Hans, 2011; Park and Casella, 2008), with 
Poison and Scott (2010) providing a unifying local-global scale mixture rep- 
resentation. The lack of tight concentration bounds for such priors has lim- 
ited the study of their asymptotic properties. We develop a novel class of 
local-global shrinkage priors for which such bounds can be obtained, leading 
to a rate of y (log p n .) 7 / n hi operator norm in the p n S> n setting. 

Technically, our methods proceed via the usual route of first establish- 
ing the prior concentration and constructing test functions with appropriate 
rates independently and then combining these two using entropy calcula- 
tions. For constructing test functions for the operator norm, the traditional 
tests based on likelihood ratios do not yield the right rates. Instead we con- 
struct tests inspired by results from the non-asymptotic theory of random 
matrices and hence these calculations are of independent interest and may 
be useful to other problems in high-dimensional estimation. 

High-dimensional covariance matrix estimation has been widely studied 
from a frequentist perspective. The inadequacy of the sample covariance 
in p n ^> n settings is well known, motivating regularized estimators based 
on banding or tapering the sample covariance matrix (Bickel and Levina, 
2008b; Furrer and Bengtsson, 2007; Wu and Pourahmadi, 2010), banding 
the Cholesky factor (Wu and Pourahmadi, 2003), regularizing the inverse 
Cholesky factor (Huang et al., 2006; Levina, Rothman and Zhu, 2008), thresh- 
olding the sample covariance matrix (Bickel and Levina, 2008a; Cai and Liu, 
2011; El Karoui, 2008), regularizing the precision matrix (Rothman et al., 
2008) and regularized principal component analysis (Johnstone and Lu, 2009; 
Zou, Hastie and Tibshirani, 2006) among others. Theoretical properties of 
such regularized estimators have been studied in Bickel and Levina (2008a, b); 
El Karoui (2008); Lam and Fan (2009), with explicit rates of convergence 
obtained in an asymptotic framework where p n increases with n. Minimax 
optimal rates in operator & Frobenius norm have also been recently estab- 
lished in Cai, Zhang and Zhou (2010). 

The rest of the paper is organized as follows. After setting up the basic 
notations and definitions in Section 2, we present the main results of this 
paper in Section 3. In Section 4, we discuss and provide guidelines for prior 
elicitation based on the theoretical results. Section 5 develops a number of 
auxiliary results of independent interest that are used to prove the main 
results in Section 6. The proof of some technical lemmas are given in an 
Appendix. 
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2. Preliminaries. Given sequences a n ,b n , we shall denote a n = 0(b n ) 
if there exists a global constant C such that a n < Cb n . 

Given a metric space (X, d), let N(e;X,d) denote its e-covering number, 
i.e., the minimum number of balls of radius e n needed to cover X. 

For a vector x £ R r , ||x|| 2 denotes its Euclidean norm. We will use S r to 
denote the unit Euclidean sphere {x £ R r : \\x\\ 2 = 1} and A r_1 to denote 
the (r — l)-dimensional simplex {x = (x\, ■ ■ ■ , x r ) T : Xj > 0, Y^=i x j = !}• 
Further, let Aq _1 denote {x = (x±, . . . , x r _i) T : xj > 0, Y^=i x j — !}• 

For a square matrix A, tr(A) and |^4| respectively denote the trace and 
the determinant of A. For a p x r matrix A = (a,jj>) with p > r, using the 
singular value decomposition we may write 

r 

k=l 

where sn\ > S(2) > • • • > s (r) — denote the singular values of A (or equiv- 
alently the eigenvalues of \J A T A) arranged in decreasing order and , 
denote the corresponding singular vectors. We shall also use s m i n (A) and 
Smax{A) to denote the smallest and largest singular values respectively. We 
will investigate the posterior convergence rates for two norms; the Frobenius 
norm (||-||^) and the operator norm (||-|| 2 ) defined in the usual way: 



\ j=i i'=i 

SUp ||Ab|| 2 = Smax(^)- 

Clearly, for any fixed dimension p, the above two norms are equivalent and 
thus convergence rate in one norm will lead to identical convergence in the 
other. However this is no longer the case when the dimension p = p n grows 
with n. In fact we will see below that the convergence rates are indeed 
different for the two norms above. 

For a subset S C {1, . . . ,p}, let |5| denote the cardinality of S and define 
9 S = (0j : j G S) for a vector 9 S W. Denote supp(#) to be the support of 
6, i.e., the subset So C {1, . . . ,p} corresponding to the non-zero entries of 
9. We shall continue to use the same notations for a subset of entries and 
support for matrices A, where it has to be interpreted that A is vectorized 
column- wise. Let Zq [ s ? p] denote the subset of MP given by 




l [s;p] = {x G R p : #(1 < j < p : Xj / 0) < s}. 



G 



Clearly, Zo[s;p] consists of s-sparse vectors 6 with |supp(0)| < s. 

IG(a, 6) (resp. IGr Cj( fl(a, b)) denotes the inverse-gamma density with pa- 
rameters a,b (resp. truncated to the interval [c, d]). Throughout C,C are 
generically used to denote positive constants which are irrelevant to our 
purpose. 

Finally, let C n denote the cone of covariance matrices of size p n x p n and 
let Son £ C n denote a true sequence of covariance matrices. We observe 

yi,...y n 1 ~ d N Pn (0,So„) 
and set y( n ) = (y\, . . . , y n ). We model the data as 

(2.1) j/i~ d N p „(0,S n ), Z n = A n Al + n n . 

3. Main results. In this section we present the main results of this 
paper. Let O^'^ € MP xk denote the class of real- valued p x k matrices. We 
start with the following assumptions on the true covariance matrix of the 
observed data y ( n ) . 

Assumption 3.1. The true sequence of covariance matrices Son are of 
the form 

(AO) S 0n = A 0n AQ n + J7on , A 0n £ 6^"*"', Q 0n = o-Q n lp n , 
and ko n = 0(1) is known. 

Assumption (3.1) says that the true sequence of covariances So n admit a 
factor decomposition as in (1.2) with Qo n = o~Q n l Pn . We assume O n = o- 2 I Pn 
in (2.1). We also assume the true sequence of factors ko n to be bounded and 
known for notational simplicity and it is to be understood that the model 
(2.1) is fitted with k n = ko n many factors. However, we have identified the 
role of k n in all calculations and our results can be easily relaxed by placing 
a prior on the number of factors k n and assuming an appropriate growth of 
the true number of factors ko n . 

A prior distribution IT ra (A n ® a 2 ) on Q^ n ' kn ^ x M + induces a prior distri- 
bution on C n , which is also denoted by n n (£ n ). We will denote by n n (-|y( n )) 
the corresponding posterior distribution for S n . For a sequence of numbers 
e n — > and a constant M > independent of e n (to be chosen later), let 

(3.1) U n = {^ n : ||S„-Son|| <Me n ] 

denote a ball of radius Me n around Xo n with respect to some matrix norm 
||-||; we shall focus on the Frobenius norm and the operator norm 
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C II "II 2) m the sequel. For various prior distributions on 0^ n ' x K + , we 
seek to find a minimum such possible sequence e n and a subset Con C C n 
such that for any Son £ Com 

(3.2) lim n n (C/^ I y (n) ) = 0, in probability 

n— >oo 

where Il n (?7^ | y^ n ^) denotes the posterior probability of the event U°. Notice 
that the posterior measure is random since it is conditioned on the observed 
data; thus the above limit is over the probability space corresponding to the 
observed data. 

3.1. Frobenius norm. We now mention specific assumptions on T,Q n and 
prior choices. 

Assumption 3.2. In addition to Assumption 3.1, the true covariance 
matrix Eon G C n satisfies the following: 

(AF1) p n < n with liiQn^oopZ/n — > for some 7 > 1 . 
(AF2) ^<al n <M a . 

We assume the following prior on [A„]j/, = and a 2 , 

(PO) Aj h ~N(0,l),j = l,...,p n ,h = l,...,k n , a 2 ~IG [0 , W( ,](o,6). 

We now state our main theorem on posterior convergence rates in Frobenius 
norm. 

Theorem 3.3. Suppose the true sequence of covariance matrices Son 
satisfy Assumption 3.2 with 7 > 9 in (AF1). Also, assume the prior distri- 
bution Il n (A n (g> a 2 ) as in (PO). Then with e n = \J (log n) 3 and for some 
M > large enough, 

lim E Son n n (||£ n - Son || F > Me n \ y (n) ) = 0. 

n— >oo 

3.2. Operator norm. Cai, Zhang and Zhou (2010) showed that the min- 
imax optimal rate in operator norm is given by y^og p n /n for their sparsity 
class. Although their sparsity class is different from the one implied by factor 
models, it would be appealing to obtain a similar rate of convergence as the 
effect of dimensionality enters only through a logarithmic factor. We now 
mention specific assumptions on T,Q n and prior choices. 

Assumption 3.4. In addition to (AO) in Assumption 3.1, the true co- 
variance matrix Son G C n satisfies the following: 
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(Al) lim n _> 00 (logp n ) 7 /n = for some 7 > 1. 

(A2) Each column of Aq h belongs to lo[s n ;p n ] with s n = 0(logp n ). 
(A3) There exists a sequence of positive real numbers c n with c n = 0(logp n 
such that 



—ALA, 



(A4) There exist constants and a^' such that o~q~' < a < a t 

We now discuss implications of each of the above assumptions. 

• Assumption (Al) allows p n to grow faster than n under the very mild 
assumption of (logp n ) 7 /n —> 0. In particular, p n can be of the order 
of exp(n a ) for any a £ (0, 1/7). 

• Following the motivation in West (2003), one requires sparsity in the 
loadings for meaningful inference in p„ > n situations. This is re- 
flected through (A2), requiring the loadings columns to be sparse with 
0(logp n ) many signals per column. Notice that this is where we differ 
from sparsity assumptions used by previous authors (Bickel and Levina, 

2008a; Cai and Liu, 2011; Cai, Zhang and Zhou, 2010; Levina, Rothman and Zhu, 
2008). Even if the entries of A are exactly zero, the corresponding co- 
variance matrix need not have many zero entries. 

• Conditions similar to (A3) have been used previously in the economet- 
ric factor model setting (Fan, Fan and Lv, 2008; Fan, Liao and Mincheva, 
2011) and referred to as "pervasive", meaning the factors influence all 
the variables. We provide a different intuition based on random matrix 
theory which suggests that (A3) is indeed mild and expected to be 
satisfied by a large class of loadings. 

If the elements of the p n x k n matrix Ao n are drawn i.i.d. from a N(0, 1) 
distribution, then Theorem 5.39 of Vershynin (2010) tells us that 



h 



0{\fk n /p n ). 



(2) 



r (l) 



(2) 



I 



h 



< C- 



with probability at least 1 — e 
Equivalently, all singular values of A. 0n /y/K he in (l-C^M, l+C^p) 



/Pn 

for some constants C',C > 0. 

%. 1 ~ ' " VP" ' 

with high probability. Intuitively, this tells us that "tall and skinny" 
matrices when appropriately normalized behave as approximate isome- 
tries. 

As our emphasis is on sparse factor models, a more realistic generative 
model for the loadings would be 



A 



Ojh 



I - vr n ),5o + vr n N(0, 1) 
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where X jh = [A-an]jh, <5o denotes a point mass at zero and 

III 

to reflect the sparsity assumption in (A2). 
A modification of Theorem 5.39 of Vershynin (2010) implies 

Pn 

which in turn yields that 

A-On^-On - h, 

with probability at least 1 — e~ c ' kn . Since s n = 0(logp n ) by (A2), we 
let the normalizer c n in (A3) to be 0(\ogp n ). 
• (A4) simply posits that the residual variance is bounded above and 
below. The lower bound is used to avoid Eon from being ill-conditioned. 
See Remark 3.7 for a discussion on relaxing this assumption. 

We now define our prior U n (A®a 2 ) on Q^ n ' kn ^ xK + through independent 
priors on the loadings A and the residual variance a 2 . We draw a 2 from a 
density f a on (0, oo), 

(PR) o 2 ~ U-) . 

We first consider a class of point mass mixture priors on the loadings 
similar to that advocated by West (2003), 

X jh ~ (1 - 7r)5 +7Tflf(-), j = I, - • • ,Pn, h = l,...,k n , 

(PL1) 7T ~ Beta(l, np n + 1), k > 0, 

where 5o denotes a point mass at zero and g is an absolutely continuous 
density on R with exponential tails or heavier. 

In the context of linear regression, Scott and Berger (2010) showed that 
such point mass mixture priors with a beta hyper-prior on the mixture prob- 
ability lead to an automatic multiplicity correction. Jiang (2007) proved op- 
timality results in estimating the predictive under such priors in generalized 
linear models accommodating diverging numbers of predictors. Castillo and van der Vaart 
(2012) studied concentration properties of a class of prior distributions sim- 
ilar to (PL1) on a high-dimensional normal mean and showed that they lead 
to the minimax optimal rate of convergence. 

With the prior specification complete, we are in a position to state the 
first theorem on posterior contraction rates in the operator norm. 



2 v Pr* 
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kn 



< C- 
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Theorem 3.5. Suppose the true covariance matrix T,Q n £ C n satisfies 
Assumptions (AO) - (A4) in Assumption 3.4 with 7 > 5 in (Al). Also 
assume independent priors 11(A) and n(cr 2 ) on the loadings and the residual 
variances as in (PL1) and (PR) respectively with j a being a gamma(a, 6) 



density. Then with e r 



(logPn) 5 



II 



lim E Son n n (||£ n - E „|| 2 > Me n | yW) = 0. 

n— »oo 

As mentioned in the Introduction, although point mass mixture priors 
are conceptually appealing in allowing exact sparsity and often leading to 
appealing theoretical properties, posterior computation under such priors is 
extremely daunting computationally in high-dimensional cases. As an alter- 
native, a rich variety of continuous shrinkage priors have been developed that 
admit a scale mixture representation (Poison and Scott, 2010). A fundamen- 
tal hurdle in studying theoretical properties of such priors is the difficulty 
of obtaining tight bounds on their concentration. With the motivation of 
developing a continuous shrinkage prior that can be shown to concentrate 
on sparse vectors and approximate point mass mixture priors, we propose 
a novel class of priors. We use such priors for the factor loadings, but they 
should be broadly applicable in other high-dimensional settings. 

Let DE(^) denote the Laplace or double-exponential density with scale 
parameter tp with a density given by 

(3.3) f(x) = —e-$,xeR. 

Draw the elements of a high-dimensional vector S M p through the fol- 
lowing hierarchical mechanism: 



(PS) Qj ~ DE(t7j), t~/ Tj t~/- 



7' 



where f T and / 7 are densities on M + and Aq 1 respectively. In particular, we 
require f T to satisfy (a) P(r > logp) < e~ clo z p , (b) P(r e [2 log p, 4 log p] ) > 
e -Ciogp an( ^ p^ T ^ 1/logp) < e~ clogp for large values of p. In Lemma 
A.l stated in the Appendix, we show that the IG (logp, logp) distribution is 
one possible candidate for f T . We also choose / 7 to be a Dir(a/p, . . . , ctjp) 
density, where Dir(ai, . . . , a p ) denotes a Dirichlet distribution with param- 
eters a±, . . . ,a p which has a density / 7 on Aq 1 given by 

r(a) p_1 1 pl 1 

LLj=i v 3) j =l j =1 
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Although the prior specification in (PS) has similarities to local-global 
shrinkage rules in Poison and Scott (2010), a main difference is that the 
local scale parameters in 7 are drawn jointly from a Dirichlet distribution 
instead of independent draws from a continuous distribution on M + . In par- 
ticular, constraining the local scale parameters to lie on the simplex pre- 
vents the l\ norm of 9 from blowing up with increasing dimension, while a 
Dir(a/p, . . . , ot/p) prior on 7 ensures that a handful of entries are left un- 
shrunk with the rest heavily shrunk towards zero. For a detailed discussion 
on our proposed prior and connections to point mass mixture priors, refer 
to Section 4. 

We show in Theorem 3.6 that our proposed shrinkage prior on the vector- 
ized loadings indeed works as a surrogate to the point mass mixture priors 
because they achieve the same posterior rate of convergence (up to a log 
factor) as in Theorem 3.5. 

Theorem 3.6. Suppose the true covariance matrix T,Q n G C n satisfies 
(AO) - (A4) in Assumption 3.4 with 7 > 5 in (Al). Furthermore, suppose 
that the vectorized loadings are drawn according to the shrinkage prior in 
(PS) and the prior on a 2 is as in (PR) with f a being a gamma(a, b) density. 

Then, with e n = ^fLEiiZ 

lim E Eo „II n (||E n - S „ || 2 > Me n I y (n) ) = 0. 

n— >oo 

The following remark clarifies the lower bound assumption on a in (A4). 

Remark 3.7. (A4) assumes the lower bound to be fixed rather than 
decaying with p n for technical simplicity. We claim without proof that one 
can actually let = C /(logpn) 1 / 4 incurring only minor changes in the 
proofs for Theorems 3.5 and 3.6. In that case the rates of convergence are 

slowed down by a (logn) term, i.e., e n = \J ^ log ^"^ - (log n) K for some constant 

K>0. 

4. Shrinkage prior in high-dimensional settings. Let 9 be a p- 

dimensional vector and 9q G ^o[s;p] be an s-sparse vector with s = O(logp). 
Depending on the problem, 9 might correspond to a high-dimensional mean 
vector, a vector of regression coefficients or a column of the factor load- 
ings, with 9q corresponding to a sparse truth. A quantity of fundamental 
importance in studying the behavior of the posterior distribution in these 
high-dimensional problems is the prior concentration or the non-centered 
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small ball probability 

(4.1) P (||#_fl || 2<e ) 

around sparse vectors 8q. It can be shown that if 8j's are i.i.d. standard 
normal, 

sup P(||0-0 O || 2 < e) < e - Cploe{ ^ 
foe/o[s;p] 

which decays exponentially with p for fixed s limiting the ability of the 
posterior to concentrate on sparse 8q. 

However, with appropriate point mass mixture priors, the small ball prob- 
ability (4.1) can be improved to e~ Csl ° s ^\ We thus discuss some of the 
salient features of point mass mixture priors here and illustrate how these 
features can give insights for developing continuous shrinkage priors. 

Castillo and van der Vaart (2012) recommended the following hierarchi- 
cal prior on 8: 

(PI) An integer j is chosen according to a prior probability ir p on {1, . . . ,p}. 
(P2) A subset S of size j is chosen uniformly at random from the (^) subsets 
of size j. 

(P3) Given (J, S), elements of 8s are drawn independently from a probabil- 
ity distribution with Lebesgue measure goni and this is extended to 
6 G W by setting the remaining coordinates to 0. 

The commonly-used point mass mixture priors of the form 8j ~ (1 — 
tt)5q + ng arise a special case of the above general framework with the prior 
TT p on the subset size corresponding to the Binomial(p, ir) prior. 

Now suppose 8q G Io[ s 'iP] and let So denote the support of 8q. Then, 

(4.2) P(||0 - O || 2 < e) > n(,So) F(\\8 So - 8 OSo )\\ 2 < e), 

where n(S'o) denotes the prior probability of choosing the subset So. In 
particular, under the Binomial(j>, ir) prior on the subset size, we have 

n(So) = 7r s (l-7r)P- s . 

If one knew s beforehand, an intuitive choice for ir is s/p, with the corre- 
sponding prior referred to as the oracle prior by Castillo and van der Vaart 
(2012). With this choice, 
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and 



n\\0 So ~ OqSoW <e) >e 



Cslog(l/e) 



leading to a higher prior concentration. Castillo and van der Vaart (2012) 
further showed that even without knowledge of s, one can achieve simi- 
lar concentration around sparse vectors through a Beta hyperprior tt ~ 
Beta(l, up + 1), k > 0. 

To avoid the computational difficulties associated with point mass mixture 
priors, a number of recent works aim to develop a continuous shrinkage prior 
that effectively mimics the mixture priors. Poison and Scott (2010) unified 
a number of such priors through the following scale-mixture representation: 



where ipj and <j> are local and global scale parameters, respectively. De- 
spite computational advantages with this family of shrinkage priors, it is not 
clear whether they have adequate concentration around s-sparse vectors. We 
found that a suitable dependence structure in (ipi, . . . , ip p ) can force a large 
subset of the local scales ipj to be simultaneously close to zero and thus 
achieve a concentration similar to point mass mixture priors. This observa- 
tion motivated the shrinkage prior (PS) in Section 3, where we let tpj = Tjj 
with r > and 7 = (71, . . . ,7 P ) T E A p_1 with 7 ~ Dir(a/p, . . . ,a/p). 

We now exhibit some aspects of our proposed shrinkage prior (PS). We 
shall first show that (PS) achieves the same concentration around sparse vec- 
tors as the point mass mixture priors in Castillo and van der Vaart (2012). 
We further exhibit a tail bound on the number of "large signals" implied by 
(PS) and conclude the section by proving a large deviation result for the l\ 
norm of a vector drawn from (PS). 

In the following Lemma 4.1, we show that under a mild restriction on the 
magnitude of the non-zero entries of 6q, the hierarchical prior specification 
in (PS) leads to the same order of concentration around elements in Zo[s;p] 



Lemma 4.1. Suppose 6 is drawn according to the prior (PS). Let 9q € 
lo[s;p],l < s < p with ||#o|li = O(slogs) and s/p < 1/2. Then, for any 



(4.3) 



as (PI) - (P3). 



£€(0,1) 



IP(||6» — 6» || 2 < e) > exp[-Cmax{slog(s/e),logp}] 



for some constant C > 0. 
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Proof. Let 5 = e/p. To lower-bound P(||0 - 8 \\ 2 < e), we first obtain a 
lower bound conditioned on the hyper parameters r and 7: 



P(||0-0o|| 2 <e|T, 7 ) 

> n\0j\ <S\/j£S c \ r, 7 ) HPso - Oosoh < e/2 I r, 7 ) 



(4.4) 



e ^' 



xP(||^ o -0 OSo || 2 <e/2|r,7). 



Let 7 = (71, ... , 7 p -i) T and 7 P = 1 — Y^j=i lj- We now have to integrate 
out r and 7 in (4.4). By a relabeling of indices, we can always make sure that 
the pth index lies in So. Let Si = So\{p} so that Sq U Si = {1, . . . ,p — 1}. 
For a fixed r in the interval [2s, 4s] and numbers a,b S (0, 1) with 6 = 4a, 
let ^4 T denote the subset of Aq _1 given by, 



(4-5) Ar = (o < 7, < ; /, ; V j G S C ; 7, € 
L log(p/s)r 



a 6 

T T 



VjesA. 



log (p/s)r 

Observe that A T defines a valid subset of Aq _1 for e small enough, since 
7j > for all j = 1, . . . ,p — 1 and 

P-i 



(4.6) 



for e < 6/2. Thus, 

-<y 2 < e ) 



E^j = E 7i + E 7i < ^ + ^27^ < b < 1 

jes c 



(4.7) 



> 



(r,7)6M+xAg 



(T, 7 )eB 



p-1 



h\\ 2 < e I 7", 7) Mdj)f T {dr) 



where /3 = LLp [2 S ,4 S ]^r with £ r = {r}xi T cK+xAg \ We now substitute 
the lower bound for P(||# — ^0 1 1 2 < 6 I T ^l) from (4.4) in (4.7) and lower- 
bound the two terms on the right hand side of (4.4) individually. 
For the first term, observe that for (r, 7) S B, 



> (1-s/p) 



p—s 



To tackle the second term, we make use of the following Lemma 4.2 whose 
proof is provided in the Appendix. 
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Lemma 4.2. Let rj £M S denote a random vector with independent com- 
ponents rjj ~ DE(ipj). If there exist numbers a, b > such that tftj G [a, b] 
for all j = 1, . . . , s, then for any 5 > and rjo G 



??o|| 2 < 8) > 



exp{- S log2-gM|( 1 



-5/(6^) V 



By definition, tpj G [a, b] for all j G Si whenever (r, 7) G Z3. Further, 
along the lines of (4.6), Y^j=i lj e [ a /8, and hence 7 P G [1 — b, 1 — a/8] on 
B. Since a, 6 are constants , by a slight abuse of notation, we shall assume 
ipj G [a, b] for all j G So on £>. It thus follows from Lemma 4.2 that 

p(||n 5o (0)-n5 O (^o)|| 2 < e /2|r, 7 ) 

> exp { - a log 2 - £ M Ul _ e -/(^ 

Since 1 — exp(— x) > x/2 for all x G [0,1], for e small enough so that 
e/(2by/s) < 1, we conclude that for (r, 7) G 5, the integrand in (4.7) can be 
bounded below as follows: 

P(||0-0 o || 2 <e|r,7) 

> (l- S /p) p " s exp( -slog2- y i^ + slog-^1 



(4.8) > e 



-Cs 



expj -slog2- V ^zi + slog — ^—\, 



where the last inequality uses (1 - x) x l x > l/(2e) for < sc < 1/2 and 
C = log(2e). It thus remains to obtain a lower bound to 



F(B) = / f 7 (dj)f T (dr) 
(4.9) = [' F(A T \r)f T (dT). 

Jt=2s 

Now, since 7 ~ Dir(a/p, . . . ,a/p), recalling the definition of A T from (4.5) 
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and using (4.6), 

. , r(a) 



T(a/p)P J^ £At 



p-1 

lb 



a/p—l 



P-1 \a/p-l 

i=i y 



/ 


' n 


X 


- n 7 « /P -r 


071 .. . c?7p 











(4.10) 

where 
(4.11) 

(4.12) 



> C p (l - b) 



a/p—l 



log(p/s) 



alp / \ a/p \ s— 1 



a 



r(a) /py 1 

r(a/p)pVay 
= exp{logr(a) + (p — l)log(p/a) -p log T(a/p)} 

> exp{logT(a) — logT(a/p)} 

> exp{logT(a) — log(p/a)} 



with the last two inequalities using T(x) < 1/x for all x G (0, 1). Moreover, 
since b > 4a, we have for r G [2s, 4s], 



lj\ a /p / a \ a /P} s_ i 



> 



(4.13) 



> 



\ (s-l)a/p 

Ts 



1 — exp 



n 



log(26/a) 



a/p \ s— 1 



Equations (4.12) and (4.13), in conjunction with the fact that 1 — e _:r > x/2 
for x G (0,1) implies that the expression in (4.10), and thus P(«4 r | r) in 
(4.9), is bounded below by 



(4.14) 

P(.At I r) > Cexp 



( «(p ~ f) 1q g 

\ p log(p/s) & a log(6/2a) & a 



P 

log- 



P 

log- 



for some constant C > 0. Finally, (4.8) and (4.14) substituted into (4.7) 
gives us 



y o\\ 2 



< e) > P[r G (2s, 4s)] e -^ x { s iog(sA),iog P }_ 



The proof of Lemma 4.1 is completed upon observing that P[r G (2s, 4s)] > 
e -C*iogp by definition. □ 
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We would next like to show that the shrinkage prior in (PS) doesn't 
spread its mass across too many dimensions. A point mass mixture prior 
allows a high-dimensional vector to collapse onto fewer dimensions. Hence, 
the implied dimensionality can be naturally studied through appropriate 
tail bounds for the induced prior on |supp((9)|, which is a random variable 
supported on {0, 1, . . . ,p}. Such bounds on the prior dimensionality lead to 
better control of the metric entropy and enable construction of sieves, see 
Castillo and van der Vaart (2012). However, continuous shrinkage priors do 
not allow exact zeroes in 9 and clearly P(|supp(#)| = p) = 1. Recalling the 
intuition that (PS) shrinks a large subset of the entries in 9 close to zero 
while allowing a few large signals, we devise a generalized definition of the 
support of a vector as the subset of entries which are larger than a small 
number 6 in magnitude. For any 5 > 0, we denote the corresponding subset 
to be supp <5 (0), so that 

supp 5 (0) = {j : \6j\ > 5}. 

In the following Lemma 4.3, we provide a tail bound for supp^^) that is 
crucially used later in Section 6. 

Lemma 4.3. Let e G (0,1) and 5 = e/p. If 9 is drawn according to the 
prior (PS), then there exists a constant A > such that 

P(|supp 5 (0)| > Alogp) < e - closp 

for some constant C > 0. 

PROOF. Let s = log p. Clearly, for any A > 0, 

(4.15) 

P poo p 

P(|supp 4 (0)| >As)=Y, / , P(|supp*(0)| = 3 I r, 7 )/ 7 (d7)/ r (dr) 

Observe that 

P(|su PP(5 (0)|=j|T,7)= £ IK IK 1 -**) 

S:\S\=jj&S J'65 c 

where ttj = P(|%| > S \ j,t) = e~ 5 /^ by the prior specification in (PS). 
We can clearly restrict our attention to {r > To} with 

log(p/s) 
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since by definition of f T , 

(4.17) P[t < r ] < exp{-Clogp} 

For a fixed r > To, consider £ T C Aq _1 with, 

5 5 



£ r = i 7 : lj £ 



r log(2p/ s) ' r log(p/ s) 



, j = l,...,p-l 



Clearly, £ r defines a valid subset of Aq~* for any t > tq since X)j=i 7j — (p~ 
1)5 / {t \og(p / s)} < 1 by (4.16). Moreover, on £ T , -Kj G [s/2p,s/p] for all j = 
1, . . . ,p— 1 and thus it follows from Lemma 4.2 of Castillo and van der Vaart 
(2012) that 

p—s 

(4.18) J2 P d su PP^)l =3 K,7) < e~ Cs 

since (1 — x) 1 ^ < 1/e for all x G (0, 1). The proof of Lemma 4.3 will be 
completed if we can show that 

(4.19) P(££ | r) < e - closp 

for any r > ro- To that end, proceeding along the lines of the calculations 
in (4.10), 



(4.20) < C p \ 1 - ( -f^^^ ) 

(4.21) x i 1 



T\og{p/s)J \r\og{2p/s) 
(p-l)5 



rlog(2p/s) 

with C p as in (4.11). To obtain an upper bound for C p , we study the function 
g(x) = i log (jj^jy) near zero in the following Lemma 4.4; a proof can be 
found in the Appendix. 

Lemma 4.4. The function g(x) = ^ log ( x y( x ) ) ^ s mcm °£° n 2ca//?/ decreas- 
ing on (0,1/2) urc'i/i limaj-i.o d( x ) = 7o? where 70 = — r'(l) zs t/ie Euler con- 
stant. 
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Letting x = a/p in Lemma 4.4, p\og{p/a) — p\ogT(a/p) = ag(x) < cryo- 
Hence, 

C{p) = exp{logT(a) + (p — l)log(p/a) — plogT(a/p)} 
< C exp{— log(p/a)} 

for some constant C > 0. Note that Lemma 4.4 is indeed needed, since 
the usual T(x) > l/(2x) on (0,1) would lead to the less stringent bound 
C(p) < 2P. 

Now, for r > tq, 



< 



a/p / § \ a /P 



T\0g{p/s)J \T\og{2p/s) 

8 \<p-V/p\^ { bg(p/a) \ a/p ' 



Tlog(p/s) 



1 



log (p/s) + log 2 



/ 1 \ a(p-l)/p 
< ( — l —) < e -«/21og( P -l) < 1; 

implying the second term in (4.20) can be bounded by 1. Equation (4.19) is 
established upon observing that the term in (4.21) can be bounded above 
by a constant. 

Equations (4.17), (4.18) and (4.19) imply that each summand in (4.15) is 
bounded above by e~ cl ° sp . Noting that there are (p — As) such terms, the 
proof of Lemma 4.3 follows by choosing A suitably large. 

□ 

A final important property of the proposed shrinkage prior is established 
through the following large deviation result on the h norm of 9: 

Lemma 4.5. We have P[||f 1^ > (logp) 2 ] < e~ CXo ^P. 

Proof. Recall 0j ~ DE(t7j)- Let Xj = 0j/{Tjj), clearly Xj ~ DE(1). 
Let ipj = Tjj and fix t > O.We now use a Bernstein-type tail inequality for 
sub-exponential random variables (Proposition 5.16 of Vershynin (2010) ) 
to conclude 

p v 

>t|r, 7 )=P(X)l^l >*) 

3=1 3=1 



—C min s - 

<e ^ J < maxje ' ,e ' }. 
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The last inequality in the above display uses H7H2 < 1 and the fact that 
e — c/x j g i ncreas i n g with x. Thus, with t = (logp) 2 , 

P{(||0|| 1 > t) n (r < logp)} < e - closp . 

The proof is completed since P(r > logp) < e~ cl ° SP . □ 

5. Auxiliary results. In this section, we provide a number of auxiliary 
results that are used to prove the main results in Section 3 and are also of 
independent interest. 

5.1. Some matrix results. We begin with some matrix inequalities 
that are used throughout. 

Lemma 5.1. For any two matrices A,B, 

(i) s min (A) \\B\\ F < \\AB\\ F < \\A\\ 2 \\B\\ F 

(ii) s min (A)\\B\\ 2 < \\AB\\ 2 < ||A|| 3 ||5|| 2 
(hi) s min (A) 

° min 

(AB) < \\A\\ 2 s min {B). 

The next lemma comes handy in manipulating the log-likelihood ratio of 
two multivariate normal densities. 

Lemma 5.2. For p x p positive definite matrices £,£', 
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s -l/2 E / E -l/2 _ j 



2 



(Rl) tr^E'S- 1 -^; 
(R2) loglE'E^ 1 ] -t^E'E^ 1 -I p ) < . 

Proof. To prove the identity in (Rl), observe that by similarity, E'E -1 
and jr^E'S" 1 / 2 have the same set of non-zero eigenvalues and thus 

tr^E'E^ 1 - I p ) 2 } = trfCE-^s's-Va _ i p f] . 

The proof is completed upon observing tr(A 2 ) = \\A\\ F for any symmetric 
matrix A. 

To prove (R2), let £' = £ + R, so that E'E" 1 - I p = RT,- 1 . Since 
E~ 1 / 2 E / E~ 1 / 2 is positive definite, by the similarity argument in the para- 
graph above, all eigenvalues of E'E -1 are positive. Let us denote these eigen- 
values by 1 + 9j,j = 1, . . . ,p with 9j > —1. Thus, 

v 

log lE'E" 1 ! - trtE'S- 1 - I,) = ^{log(l + 6j) - 9 3 } < 0. 

□ 
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The next Lemma is adapted from Lemma 5.36 of Vershynin (2010). 
Lemma 5.3. For a p x k matrix B with p > k, suppose 
\\B T B - I fc || 2 < max{<5,5 2 } 

for some 5 > 0. Then, 

1-S< s mhl (B) < s max (B) <1 + S. 

5.2. A large deviation result for quadratic forms. Lemma 5.4 be- 
low provides an exponential tail bound for the sample average of symmetric 
quadratic forms around the population mean. 

Lemma 5.4. Let £i,...,£ n ~ N p (0, I p ) and A be a p x p symmetric 
matrix. Define Qi = £ T ^4£j. Then, for every t > 0, 

P 

for some absolute constants C,K > 0. 

Proof. Since A is symmetric, all eigenvalues of A are real. Let A = 
VDV T be an eigendecomposition of A, with V apxp orthogonal matrix and 
D = diag((ii, . . . , d p ) a diagonal matrix of the eigenvalues. Letting m = V T ^i, 
clearly rji ~ N p (0, I p ) since V is orthogonal. Thus, 

i=l i=l j=l 

We now use Proposition 5.16 of Vershynin (2010) which provides an expo- 
nential tail inequality for centered sub-exponential random variables. The 
proof is completed by noting that rjfj — 1 is centered sub-exponential and 
||A|| 2 = maxj \dj\ ; \\A\\ F = Y^=i^j- ^ 

A standard approach (Ghosal, Ghosh and van der Vaart, 2000) in Bayesian 
asymptotic theory to establish a posterior contraction rates (say e n ) is to de- 
velop exponentially consistent test functions for the true density versus the 
complement of an e n ball (in an appropriate norm) around the truth with 
type I and II error rates of the order exp(— ne 2 ). This serves as an asymp- 
totic identifiability criterion where the likelihood can differentiate the true 
density from ones that are e n apart. The choice of the distance metric plays 



1 n 



tr(A) 



> t 



< 2exp 



C min 



i=l 



nt z 



nt 



K*\\A\\y K\\A\\ 2 
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a crucial role in dictating the error rates. Accordingly, the next two subsec- 
tions are devoted towards developing point null versus point alternative test 
functions in Frobenius and operator norm. 

Specifically, based on n i.i.d. samples yi, . . . ,y n from N Pn (0, E n ), consider 
testing 

(5.1) H : E n = S 0n versus Hi : E n = E ln , ||E 0n - Ei n || > je n 

for some integer j > 1 with ||-|| the Frobenius or the operator norm. We shall 
denote probabilities/expectations under Eo n and Ex n by Eo/Po an d Ei/lPi 
respectively. 

5.3. Test function construction in Frobenius norm. As mentioned 
earlier, we consider p n < n for the Frobenius norm. We show that, 

Theorem 5.5. Let Eo n ,Si n G Con with p n < n. Let pi n ,2in > 1 &e 
sequences such that 

(5.2) — < s m i n (Ej n ) < s max (E/ n ) < pin 

for I = 0, 1. Let e n = e n /pi n . Then, there exists a constant J > 1 such that, 
for any j > J, there exist a sequence of test functions (f>j^ n for (5.1) with 
INI = W'Wf su °h that, 

(5-3) E Son ^ )n < e -^ 2 ^L 

(5.4) E Sl Jl-^)< e - c ^i« 

for some constant C > andt 2n = l/(pon£on)> *in = *L min (£inMn, 

Proof. Let ^ n = S^E^ 1 - E^E^ 2 , d n = p n || F , Q t = yfp£ - 
s in)^' Qn = n 1 Y!i=i Qu an d Ai = log |E ln Eo^|. Using standard results 
for quadratic forms, 

(5.5) E Q n = tr(I p „ - EonE^ 1 ), EiQ n = tr(Ei n E^ - I Pn ). 

We define our test function in terms of the rejection region as <ftj^ n := 
^[Q n -D n >-a n d 2 ]i where < a n < 1 is to be determined in the sequel. 

We begin by establishing the type I error bound in (5.3). To that end, we 
first show that one can find f3 n G (0, 1) with E Q n - D n < -f3 n d 2 n . Clearly, 

EoOn - D n = log (EonS^I - tr(Eo n E^ - I p J. 
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— 1/2 —1/2 

Let H n denote the symmetric positive definite matrix E ln EonE lri with 
eigenvalues ipi > 0, 1 = 1, . . . ,p n . By similarity, SonE^ 1 nas the same eigen- 
values as H n . Also, ^4 n = H~ x — l Pn . Thus, 



(D n - E Q n ) - p n d 2 n = tr(H n - I p ) - log \H n \ - f5 n lltf" 1 - I Pn 



(5.6) 



Pn 

E 

i=i 



logipi - /3 n [ — - 1 



Now, by Lemma 5.1, s min (H n ) = s min (S ln 1 S „) > s min (S lre 1 )s mill (Eo n ) > 
l/(£Q n pi n ). Hence, choosing /3 n = l/(£ 0ri/ oi n ) 2 , one can ensure that the ex- 
pression in (5.6) is non-negative. Choosing a n = f3 n /2, we have 



E, 



0Wj,n, 



(5.7) 



= P (Qn ~D n > -a n d 2 n ) 

= P {Q n - E Q„ > -a n d 2 n - (E Q„ - D n )} 

< P {Qn - E Q„ > Pnd 2 n /2). 



Letting & = E 0n 1/2 yj, it follows that = £?.B n & with B n = I^-Ej^E^Ej/ 2 . 
Clearly, £j ~ N(0, I Pn ) under i?o- Using (5.5) and invoking Lemma 5.4, we 
obtain 



(5.8) 



P (Qn-E Q n > /3„d 2 /2) 
1 n 

-Vr^-tr^; 



< 



< 



cxp 



i=l 

- C min 



> /?X/2 

nf5 n d? n 



K*\\B n \\ 2 F ' K\\B n \\ 2 



Now, using Lemma 5.1 & (Rl) in Lemma 5.2, 

2 
F 



di 



v l/2 v _l v l/2 _ T 
^ln ^On^ln V 



tr^E^-IpJ 2 

1/2 v v -l/2 



(5.9) 



(^0n ^In^On ^Pn) 
^ln ~~ Eon|| ^ 
||Ei n — Eon||p > ||El n — £on||j? 



^ s min(^0n) ll^ln - E() n ||^ 



IV II 2 



POr, 
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Along similar lines, 

ll-^nlli? ^ s max(^in) 2 ll^ln ~~ ^OnH^ 
fr-in\ ||Sl n — Son||p ^ || ^ln — ^On || P 

(5.10) = < 2 • 

Also, by Lemma 5.1, 

(5.11) \\B n \\2 < 1 + Smax(S7 n 1 )s max (Son) < 2 y O n£:ln- 

Thus, from (5.9) - (5.11), 

fc-io^ /^n^n ^ ll^ln ~ ^OnH^ £ 2 „ 1 

1 ' II R II 2 H 2 ' .4 4 ' 

II -Dn||^ Pin Pin POn^On 

^ ^ II R II — =2 ' ' ~-2 2~~' 

II -"^11 2 Pin Bin Pon£on 

Equation (5.3) clearly follows from substituting the bounds obtained in 

(5.12) - (5.13) in (5.8). 

Now on to the type II error. We have 

M<f>3,n) = Pi (On " ElOn + ElQn " D n < -a n (f n ) 

(5.14) < Px(Q n - EiQn < ~a n d 2 n ), 

where the last inequality uses EiQn — D n > 0, which is immediate from 

— 1/2 

(R2) in Lemma 5.2. Letting = S 1 j/j, proceeding as before and invoking 
Lemma 5.4 once again, we can upper-bound the expression in (5.14) by 



(5.15) 



exp 



C min 



K 2 \\A n \\y K\\A n \\ 2 



exp 



C min 



K 2 ' K\\A n \\ 2 



since cZ ra = H^-nllf- Also, by Lemma 5.1, 

(5-16) ||A n || 2 < 1 + Smax(SQ n L )s ma x(Sln) ^ ^PlnQorf 

Thus, up to constants, 



fr 2 .2 \ ll^ln Son||j7 1 

(5.17) a n d n > Z2 -2 2 » 

Pin P0n@On 
/ K io\ «n^n \ ||Si„ — Son||j7 1 

( } lOuT i 2 ^ 2_ " 

Il^n|l2 Pin POnM-On 

As before, (5.4) follows from substituting the bounds obtained in (5.17) - 

(5.18) into (5.15). □ 
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5.4. Test function construction in operator norm. We now con- 
struct a novel test function for large covariance matrices (p n S> n) that 
admit a factor decomposition (1.2) to separate points in C n in the operator 
norm. We were unable to use the likelihood ratio test to attain the desired 
error rates for the operator norm. It seems difficult to exploit the inbuilt 
parsimony. We instead use a projection technique to design our tests. 

Theorem 5.6. Let E 0n , Sin e C 0n with S fa = A £n Aj n + aj n I p for i = 
0, 1. Assume that Ao n satisfies (A3) in Assumption 3.1. Let e n = \/log p n /n 
and e n = y (log p n ) 3 /n. Then, there exists a positive integer J > 1, such 
that for any j > J, one can construct a sequence of test functions (j)j >n for 
(5.1) with \\-\\ = ||-|| 2 such that, 

(5-19) ^oMn < e~° nrel , 

(5-20) E Sln (l-^, n ) <e- Cn rt 

for some constant C > 0. Moreover, if e n is changed to v n \J (logp n ) 3 /n for 
some increasing sequence v n , the conclusion of the theorem remains valid 
with e n modified to y/v^ \J 'log p n /n. 

Proof. Let Xi = {1/ c n )K^ n yi and Zi = Ao n Xj for i = 1, . . . , n, so that 
Xi £ R kn and Zi G W Pn . Denote 



I L ' L l 

J y = ~ / , Villi i = — / %i%J , £z = — / 



n ^ — , n ^ — , n ^ — / 

j=l i=l i=l 



Clearly, T, z = Ao n Y, x AQ n and Y> x = (l/c^A^SyAon- With these definitions, 
letting e n = \J (log p n ) 2 /n, we define our test function to be 



1 {||E z -S n|| 2 >je n /2} 



It is known that (see Bickel and Levina (2008a, b); Fan, Fan and Lv (2008); 
Johnstone (2001); Muirhead) the sample covariance matrix S y does not have 
the desired concentration around the population mean in operator norm 
when p n > n. To circumvent this difficulty, we exploit the near low-rank 
structure in the truth and replace S y by S z . Our guiding intuition is that 
the lower-dimensional T, x should concentrate appropriately around its mean 
(in operator norm) and we hope to carry through the same concentration 
to E z , since Aon/i/cn behaves like an approximate isometry under (A3) in 
Assumption 3.1 (see also Lemma 5.3). 
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We first show that, there exists a positive integer J* > 1 such that for 
3 > J*, 



(5.21) 

Indeed, we have 



0/i 



3§n 
2 2 



< e -Cnjel 



0/i 



= sup 
< sup 
+ sup 



-y T A 0n S a; A r n t; - v T A Qn Al n v - a^ n 



c A Qn t x A^ n v - v T A 0n A^ n v - ^v T A 0n A l 



On' 



2 T 



1 



-A 0n Aon - !pn 



< sup 

«/eM fe :||«/|| 2 <||Aon|| 2 

(5.22) 



T ^Y)77 T 

u; 1j x w — w w — w —^—Lk n w 

Cn 



_l ~2 



<I|A, 



On || 2 



+ °0n 



Since e n = y (log p n ) 2 /n, by (A3) and (A4), the second term in (5.22) can 
be bounded above by je n /4 by choosing j larger than some constant Ji. 
Thus, 



(5.23) 
Po 



0/i 



2 2 



< 



||A 



On || 2 



C77, 



> 



Now, 



EqSx = -2-Ao n [A „Ao n + <To n I Pn ]A 



0/i 



2 ^ 1 

A 0n^0n ) +— — Aj n Aon- 

Cn C n 
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Hence, 



'0n n 



(5.24) 



< 



(5.25) 



< 



E S a 



E E X 



+ 



+ 



0", 



On 



h 



+ 



-ALA, 



0n J1 0n — Ifc„ 



To tackle the second term of (5.25), simply note that \\A — IfcJ| 2 < 5 ioi A 
symmetric and some 5 G (0,1) implies that ||^4 2 — Ife n |L < 3<5. Using this 
observation and invoking (A3) and (A4), we obtain 



1 



ALAon 



Ifc 



+ 



0)i 



— Ao n A 0n - h n ) 

-n J 



3+ ^0n 



'Pn 



for some global constant C. Since 1 1 Aon 1 1 2 — ^V^n by (A3), the second term 
in (5.24) multiplied by ||Aon|| 2 can be thus made smaller than je n /8 by 
choosing j larger than some J2. Hence, continuing from (5.23), 



I A 



On || 2 



^~ k n ^ k jj 



> 



< 



I A 



On || 2 



> 



3§r. 



By a modification to Theorem 5.39 of Vershynin (2010) (see Remark 5.40), 
we obtain that for every t > 0, 



(5.26) 



> max{<5, 5 2 } 









E t x 




< e 


2 





-at 2 



for 5 = C\Jk n jn + t/y/n and some global constants C',C > 0. Choosing 
i = CVj logpn and using fc n = 0(1), we get the desired bound (5.19) if we 
can show that 



(5.27) ^>C||A 



On || 2 







E t x 


max 


2 



j log p n jhgp n 



n 



n 



Indeed, (5.19) holds for j > 1, since ||Ao„|| 2 < 2 = O (log p n ), 

E S X . = 0(1), \fj < j and e n = \Jlogp n jn £ (0,1) so that e n < e n . 
'he claim in (5.21) will thus follow by choosing J* = max{Ji, J2}. 
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We next show that, there exists a constant J** such that for j > J* 



(5.28) 



-*0n 



2 ~ 2 



< e 



-Cnj 2 e r , 



Proceeding as in (5.22), we obtain 









E z — 


> Aon U 


E 




2 





0", 



0/i 



h , 



2 

^On 



-AonAo n - lp T 



As before, the second term in the right hand side of the above equation can 
be bounded above by je n /32 by choosing j larger than some constant J3. 
Thus, 



^0n 



2 ~ 2 



< 



I A. 



On || 2 



(5.29) 



< 



I A 



On || 2 



0n- 



^x — Ifc n ~Ik„ 

Crj. 



E. r -EiE, 



< 



< 



32 
32 



By (A3) and Lemma 5.3, both || Aon/y^l^ anc ^ s min(A : n Aon/cn) can be 



EnE,,, — Ii 



LnZj r — J-i. — K 



bounded below by 3/4. Further, by (5.24), ||Ao n || 2 

can be bounded above by je n /64 by choosing j larger than some constant 
J4. Thus, for j > J** : = max{ J3, J4}, 



|Aon| 



C77, 



> 



> 



I Aon || 2 

3 

35je w 
64 ' 



"Aon(^ln — Eon)A 



On 



Aon 2 




Ex — 


E 


Ex 






Aon 




1 


2 64 






2 





_ JJn 
2 64 



-Ao n A r n (Ei n — Eon) 



iin - 3 



> 



64 " 4^ 



Aon(^ln — Eon)Aon 
Eln Eon||2 ^min 



AonAo n 



where the penultimate inequality uses (ii) in Lemma 5.1 and the fact that 
en/\/^n > e n - Hence, the quantity in (5.29) can be bounded above by 



IIA 



On || 2 



E x — EiE x . 



2 ~ 64 



whose treatment follows in a similar fashion as (5.26). 
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When e n = v n ^(log p n ) 3 /n for some increasing sequence v n , change e n 
to v n \J (log p n ) 2 / n in the definition of the test function. The rest of the 
proof goes through similarly, with the modification that we now have to 
choose t = Cy/v^y/j logp n in the display following (5.26), leading to l n = 



6. Proof of the main results. We now proceed to prove the results 
stated in Section 3. We prove the rate theorem for the operator norm with 
shrinkage prior (Theorem 3.6) in details and sketch the argument for the 
point mass mixture prior (Theorem 3.5). For Theorem 3.3 concerning the 
Frobenius norm, again only a sketch is provided. 

6.1. Proof of Theorem 3.6. For e n = \J (log p n ) 5 /n and some constant 
M > 0, define the set 

U n = {S n : ||S n — Son|l2 — M e n} ■ 

The posterior probability assigned to the complement of U n is given by 

(6 i) n (u c I V W) - u - ^fa) v ' = ^ 

( ' " ( Jy >~ /rcuiH^n„ ( E„> - v.' 

where /s n denotes a p n -dimensional N(0, S n ) distribution. Here M n an d T^n 
denote the numerator and denominator of the fraction in (6.1) 

Let a(yi, . . . ,y n ) denote the <r-field generated by yi, . . . , y n . We first show 
that we can lower-bound T> n on an event A n € o~(y\, . . . , y n ) with large prob- 
ability under /s 0n in Lemma 6.1; the proof can be found in the Appendix. 

Lemma 6.1. Let Son satisfy Assumption 3.4- Let 5 n be a sequence sat- 
isfying 5 n /s m i n (T,Q n ) — > and ?i(5^/s m i n (Eon) 2 — > oo. Then, there exists 
A n G o-(yi, y 2 , ■ ■ ■ , y n ) with ¥^ 0n (A n ) ->■ 1 such that on A n , 

V n > e-cnsl/s^u^? n n ( En : ||S„ - S 0n || F < 5 n ). 
By Lemma 6.1, it is enough to show 



n n (||£ n -£ „|| 2 >Me n |y (n) )l^ 







lim E Son 

n— too 

to prove Theorem 3.5. 

By Assumption 3.1, Eon has a low-rank structure Ao n AQ n + o~Q n l Pn with 
the true number of factors /con assumed to be bounded and known in ( A4) of 
Assumption 3.4. Also, recall the model (2.1) is fitted with k n = ko n factors. 
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Recall the supp^ notation from Section 4. Following the convention in 
Section 2, let suppy (A n ) denote the set S C {1, . . . ,p n k n } corresponding to 
the entries in A n (vectorized) larger than S' n in magnitude. Then, for some 
H > and sequences t n , 5' n to be chosen later, 



n n (||s n -s „|| 2 > Me n I y (n) )U n 



< 



£ n ~~ ^0n|| 2 > Me 

II • 

|supp 5 , (A n )| < Hlogp n , \\A n \\ 2 < t n ,a 2 < t n | y (n) )U n 



+ 



(6.2) 

E Eon n n (|supp 5 , (A n )| > Hlogp n | yW) + E So „ (|| A n || 2 > t n | y + E So „n n (. 2 > t n | y(»>). 

Let i n = C(logp n ) 2 , 5 n = C-y/log p n /n and = 5 n /p n . With these choices, 
we shall first show in Lemma 6.2 and Lemma 6.3 that the posterior probabil- 
ities of the sets ||supp (5 / i (A n )| > Hlogp n ^, {||A n || 2 > t n } and {a 2 > t n } go 

to zero, so that we can focus on the set U* = {||£ n — So n || 2 > Me n , ||A n || 2 < 
tn,& 2 < tn} n {|supp 5 / (A n )| < Hlogp n }. This will be crucial in reducing 
the entropy of the model space later on. The proofs for both the Lemmas 
are provided in the Appendix. 



Lemma 6.2. Recall 5 n = C ^/\ogp n /n and5' n = S n /p n . Then, there exists 
a constant H > such that 

(6.3) lim E Eon [n n (|supp^(A n )| > Flogp n | y (n) )U n ] = . 

Lemma 6.3. There exists a constant C > such that with t n = C(logp n ,) 2 , 

(6.4) lim E Eo?i [n n (||A n || 2 > t n | yW)l An ] = 

(6.5) lim E Eon [n n (a 2 > t n | y (n) )U n ] = . 

We now turn to proving Theorem 3.6. Let So = supp(Ao„). For a set 
5 C {1, . . . ,p n k n } with 1 5 1 < Hlogp n , let Bj t s, n denote the subset of C n : 

B j:S ,n = {£„ G U* : je n < ||S n - £ n|| 2 < (j + l)e„, supp^(A n ) = S}. 
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Then, 



E So „P(^ I y (n) )U n < E E E s 0n iP[s n g B jAn | y W] U„ 

S:\S\<Hlogp n j=M 



£ E E 

S:|fif|<Hlogpnj=Af 
(6.6) 

oo 

£ E E 

S:\S\<Hlogp n j=M 



jg^n nr=i /^ n ( £-) dn "( s ") 



Es „^i,S,n + /3i,5,n Slip E E „(1 



where $j,s, n is a test function for 

(6.7) Ho : S n = E 0n versus i?i : E n G £j,s> 

whose construction is provided below and 

n n (-Bj,s,n) 



(6. 



e -n52/^in(E0n) 2 P (|| En _ £ 0n || F < §n ) ' 

To construct the test function &j t s,n, we break up Bj t s, n into balls and obtain 
local tests for Eo n versus the centers of each of the balls using Theorem 5.6. 
Since we have already conditioned on |supp^(A n )| < Hlogp n , the number 
of such balls can be controlled and &j t s,n is obtained as the maximum of the 
local tests. 

Let E n i for I G Ij,s,n De a je n /2-net of Ba g n in operator norm and for 
each I, define Ejj = {S„ G #j,S,n : ||S„ - X n ,«|| 2 - J 6 "/ 2 }- B Y definition, 

Bj,s, n C VJ i(iI] Sn E n ^. 

Let <t>j,s,n,i denote the point versus point test developed in Theorem 5.6 
for E n = Son versus S n = E n> / with the sequence v n = logp n , so that 
e n = y (log p n ) 2 /n in (5.19) and (5.20). Clearly, 4>j,s,n,i used as a test for 
Son versus S n G Ej^,n,i retains the same type I and II error rates. Letting 
®j,S,n — rnax/g/. gn <f>j t s,n,l, one clearly has from Theorem 5.6, 



* i.fi.n) < \Ij,S,n\ e 



sup E SB (l-$ jiSin ) <e" c ^. 

To estimate \Ij,s,n\, i-e., the covering number of Bj t s, n i n operator norm, 
we first embed Bj t s, n inside a bigger set Bj t s, n i n Lemma 6.4. As we shall 
see, it is easier to estimate the covering number of -Bj,s,n- For notational 
convenience, we use Ps{Q) below to denote 6s defined in Section 2. 
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Lemma 6.4. Recall the sequence t n from Lemma 6.3. Then, 



B 



w/iere = {A n : supp^ (A n ) = 5, ||A n || F < Ct n }. 

Proof. The proof simply follows from the fact that ||A re || F < s/h^ ||A n || 2 < 
y/k^tn < Ct n since k n = 0(1) by (AO). □ 

We now proceed to explicitly construct a je n /2-net for Bj ; s, n - Let £ n = 
W(4i n ). Let {A;}f =1 be a £ n -net of fij^ n . Also, let {^}^ =1 be a je n /4-net 
of [0, i n ]. We show below that {A[AJ + er 2 }z,r form a je n /2-net of Bj t s, n in 
operator norm. 

Let E = AA T + ct 2 I be in -Bj,s,n- Find A/ and cr 2 from the respective nets 



so that 
E — E 



A, -A 



< £ n and |er 2 - ct 2 | < je n /4. Let E = A^A^ + cr 2 . Then, 



< jen/4: + 



AiAj - AA T 



< j%/4+ [||A,|| 2 + ||A|| 2 K„ < je n /4 + 2t n je n /(4t n ) = je n /2. 



We have thus proved our claim and hence the je n /2-covering number of 
Bj,s,n is bounded by L x R. Note that the control on ||A n || 2 in Bj : s, n is 
crucially used in the above display. 

Clearly R < Ct n /(je n ). With s = \S\, let {6>/}f =1 be a £„/2-net of the 
Euclidean sphere in ~R S of radius Ct n . By Lemma 5.2 of Vershynin (2010), 
the cardinality of such a net L < (1 -+- Ct n /£ n ) s . We now exhibit a £ n -net 
{A^}^ =1 to Bjg n in Frobenius norm (or equivalently the Euclidean norm 

after vectorizing) as follows. Set Ps{Ai) = 0i and Psc(Aj) = 0. Let A G #^j n 
and = Ps(A). There exists 9[ such that \\6 — 6i\\ 2 < £ n /2. Also, since 
supp^/ (A) = S, ||Psc(A)|| 2 < 5 n . By choosing j larger than some constant 
J, we can make £ n > 25 n . Hence ||A/ — A\\ F < £ n . 
Thus, finally 



(6.9) 
(6.10) 



Eso„(^,5,n)<e Cslos ™e- Cl ^, 
sup EsJl-Sj-s.n) <e~ c ^. 



We next proceed to upper-bound Pj,s,n from (6.8). To that end, we first 
lower-bound P(||E n — Eo n ||^ < 5 n ) in the following Lemma 6.5. 
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Lemma 6.5. If T,Q n satisfies Assumption 3.4, the prior on S„ is as in 
Theorem 3.6 and S n = \/log p n / n , then 

n\\^n-^ n \\ F <5 n )>e- cl °^. 

Proof. It follows that 

\\A n - A 0n \\ F < 5 n /(4^/c^),\al - a% n \ < 5 n /(2^/p^) - T, 0n \\ F < 5, 



n - 



since, invoking (A3) and by Lemma 5.1, 

\\A n Al - A 0n AU F < {2 ||Aon|| 2 + ^=^^=< S n /2. 

Now, F{\al - al n \ < 5 n /(2^)} > exp(-a 2 Qn )[l - exp{-<W(2^)}]. Us- 
ing 1 — e~ x > x/2 for x € (0, 1), this term can be bounded below by e~ c ^ ogPn . 

By (A3), ||Ao„|| 2 < implying ||A „|| F < 2^/c n k n . Letting vec(A „) 

denote Ao n vectorized, it follows by (A2) and the Cauchy-Schwartz inequal- 
ity that ||vec(Aon)||i < Clogp n - Hence we can invoke Lemma 4.1 to conclude 
that P{||A n - A 0n || F < <V(V^)i > e- Clogp ". 

□ 

Using Lemma 6.5 and (A4), Pj,s,n < e - cl °gp™. Substituting the error 
bounds obtained in (6.9) & (6.10) in (6.6), we can bound the expression in 
(6.6) by 

H\0gPn / v r OO 

(6.11) £ ( Pn ) ^ e ^i° g « e -Cu'(iog P „) 2 +/ 3. j5ne -c 2 i(iog P „)= 
s =o V s / L 



j=M 



Noting that max{ <KH iogp„} ( P / n ) — ex P{C{^ogp n ) 2 } and substituting the 
upper bound to (3j,s,n obtained above in (6.11), (6.11) goes to for large 
enough M > 0. This finishes the proof of Theorem 3.6. □ 

6.2. Proof of Theorem 3.5. The proof is very similar to the proof of 
the previous theorem, hence we only sketch a brief outline. Since the point 
mass mixture priors allow exact zeros in the loadings, we can condition on 
supp(A n ) = S here. By properties of point mass mixture priors shown in 
Castillo and van der Vaart (2012), analogues of Lemmata 6.2, 6.3 and 6.5 
can be obtained to conclude the theorem. 
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6.3. Proof of Theorem 3.3 . Without loss of generality assume k n — 1. 
Let £i„ = 0(1/ log n) and £ 2 n = 0(p n ), so that 

n^l < im) = no-- 2 > erJ) = ofor a u iar ge n, 

) > 6n) < exp{-^2n}- 

The second tail probability above follows from known large deviation re- 
sults for the largest eigenvalue of i.i.d. N(0, 1) p n x k n random matrices; see 
Corollary 5.35 of Vershynin (2010). Proceeding as in the proofs of Lemma 
6.3 and Lemma 6.2 with S n = \Jp n /n we conclude that 

) > 6n I y n ) o. 

Defining B^ n = {S n : je n < ||E 0n - S n || F < (j + l)e n } and denoting D n 
as in Lemma 6.1, 

E Son P{||S„ - S n|| F > Me n ,a 2 n > £ ) < 6n I y (re) K4 n 



oo 



(6.12) < £ 
where 



Es ^i,n + /3j,n sup E Sn (l-$ J> ) 



(6.13) - 



e-™^n n (||S n -S „|| F < 5 n ) 
and $j jn is a test function for 

(6.14) H : E n = S 0n versus Hi : £„ G 

Observe that the test function in Theorem 5.5 can also be used for testing 

(6.15) H : S n = S „ versus Hi : S in G E n 

with £7 n = {S n : ||E n — Si n || F < ||£i n — So n ||^/2}. Taking the maximum 
of the test functions for Eon versus each of the balls of type E n of radius je n /2 
covering Bj^ n , we obtain Qj )n for testing T,Q n versus H^ n . From Theorem 5.5 
it follows that 

EEo»%n < N(je n /2,B j>n ,\\ ■ \\ F )e- Cn ^ % 

where 

2 _ 1 1 2 gin 1 

92n -2-2 2 ^2 9 n \9 ' -4-4 4 /-2 <"4 „4 /l \4 ' 
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Also, from Lemma 5.2 of Vershynin (2010), 

(6.16) N(e n ,B j>n , \\-\\ F )< {l + 4(j + l)/j} p " < 9 P ". 

Defining u p as the volume of the p-dimensional Euclidean ball, it follows 
from Lemma 5.2 in Castillo and van der Vaart (2012) that 

H(flj,„) < is Pn {(j + l)en) Pn } max j [J ^(Aj) : ||A n - A 0n || < (j + l)e n 1 

3 

< exp | y log C - ^ log p„ - p„ + p„ log{ (j + l)e n } 

From Lemma A. 2 with Km = 0(^/p^), 

(6.17) /3 iin < n( J B J> )e n<5 "exp{-Cp n + p n +p n log(5 n /2Cp n )} 

for some constant C > 0. From (6.16), it follows that (6.12) is bounded 
above by 



18) \e Pn Xo ^ e -j 2 nql n el + fo^-pn&A \ 

■i—A/r \ J 



j=M 

which converges to for large enough M if e n = \l ^ log 3 n. 
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APPENDIX 

Lemma A.l. Let a = \ogp and f T denote the IG(a, a) distribution. If 
t ~ f T , then for large p, 

P(t > logp) < e~ clogp 

P(r G [2 log p, 4 log p}) > e- closp 

P(r < 1/logp) < e~ clogp 

where C > denotes a (different) constant in each display. 
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Proof. Let a = logp and assume r ~ IG(a,a). Clearly X = 1/r ~ 
Gamma(a, a) with a density f(x) = ce~ ax x a ~ 1 on (0, oo), where c = a a /T(a). 

We shall use the following result for the gamma function from Kruijer, Rousseau and van der Vaart 
(2010): For any a > 0, 

r(a) = v / 2^e- a a a - 1 /2 e e(«) 
where < 9(a) < l/(12a). Clearly for large a, 



First, 



C ie - a a a ~ 1/2 < r(a) < C 2 a a " 1/2 . 



P(X < 1/a) = c / e~ a *i 
■/ o 

l l a 1 a a 1 1 /a 

a 1 (a) a a ya \ e 
< e _c,al °g a . 



r > a) = F(X < 1/a) = c I e^H^dt 



o 



The analysis of P(r G [2a, 4a]) follows similarly to the previous display, 
only the direction of the inequality needs reversal. To that end, lower-bound 
e on (0, 1/a) by e" 1 and use the upper bound for T(a). 

Finally, consider P(r < 1/a). Note that for i > a, < e~ at / 2 . Hence, 



|»00 

P(t < 1/a) = P(X > a) = c / e^ at t a ~ x dt 

J a 

/•oo 

< c / e~ a * /2 dt = (2/a)ce" a2/2 



The result follows since c < Cyfae a . □ 

Lemma A. 2. // || A n.|| 2 G [«in,«2n],«2n > 1 and ~ N(0, l),j = 
1, . . . ,p and a 2 ~ IG[ 0j M]( a > b) and o\ n G [0, M], then for < e < 1, 

P(||E n - S 0n || F < e) > exp{-K^ n +p n +p n log(e/2K 2 n)}- 

Proof. Since a\ n G [0, M] and a 2 ~ IG[ 0i M]( a ! b), it is enough to assume 
that o" 2 = <TQ n to derive the prior concentration. Since S n = A n A^ + o"Q n / p , 
we will first express the concentration of S n around Eon m terms of concen- 
tration of A n around Ao n - Observe that 

||^ n — Sonlli? < 1 1 An^-n — A 0n A' 0n || p 

< II (A n - A n)A / n |L + ||A „(A n - A 0n )'|L 
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Using the inequality ||AB||^ < \\A\\ 2 \\B\\ F , we have 

||(A n - A 0ra )A / n || F + ||A n(An - k 0n )'\\ F < \\A n - A 0n \\ F (||A „|| 2 + ||A n - A 0n \\ F ) 

< 2n 2n \\A n — Ao n || F . 

We estimate a lower bound for P{||A n — Ao n ||^ < e/(2ft2n)} below. Clearly 
A n ~ N Pn (0, I Pn ). Observe that the RKHS of p n -dimensional Gaussian ran- 
dom vector A n is the range {x : x S MP n } equipped with the inner prod- 
uct (x,y) = x'y. Since P{|| A n - A „|| F < e/(2K 2n )} = F{||A n - A „|| 2 < 
e/(2K2n)} j w e have by Borel's inequality for the p n -dimensional Gaussian 
random vector A n 

P{||A n - A 0n || F < e/{2K 2n )} > exp(- ||Ao„||^)P{||A n || 2 < e/(2 K2n )}. 

Using the fact that — log(20(x) — 1) < 1 + | log x | for < x < 1/2, we 
have 

F{||A n || 2 < e/(2K 2n )} > exp{-p n +p n log(e/2K 2 n)}- 
The proof follows immediately. □ 

Proof of Lemma 4.2. We begin with the observation 

s 

P(\\V ~ Vo\\ 2 < 5) > J] P(\Vj ~ Vail < S/yfs). 

Now if \r]oj\ > then 

P(\Vj ~ Voj\ < 5/V~s) = \ e-\voi\^j { e VWiV«) _ V5)} 

(A.l) > - e -^ /a {1 - e~ s l {b ^>). 

On the other hand, if |r/oj| < &/yfs, then observe that the interval (r/oj — 
S I \fs , rjoj + 5 j \fs) contains either (0, S/t/s) or (— 5/yfs,Q) depending on the 
sign (positive or negative respectively) of 770,7 • Since a DE(^) density is sym- 
metric about the origin, each of the two intervals have the same probability 
implying 

P(\ Vj - mj \<5/V~s)>l{l-e- s /^} 
(A.2) >I{i_ e -f/(V*)}. 
The desired inequality in Lemma 4.2 follows from combining (A.l) &; (A.2). 
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Proof of Lemma 4.4. Using T(x + 1) = xT(x), note that g(x) = 
-log{r(x + l)}/x. Hence, lim^o g(x) = ^[-log{r(x + 1)}] X=1 = 7o . We 
shall now show that h(x) = —g(x) is increasing on (0,1/2). To that end, 
h'(x) = {x$>(x + 1) - logT(x + l)}/x 2 , where $>(x) = d/dx[log T(x)] is 
the digamma function. Let h\(x) = x 2 h'(x). Clearly, /ii(0) = 0. Further, 
h'i(x) = xT (x + 1) > since the gamma function is convex. Thus h\{x) > 
implying h(x) > on (0, 1/2) , which concludes the proof. 

Proof of Lemma 6.1. First observe that to prove Lemma 6.1, it is 
enough to show that T> n > e _n<5 ™ for a probability measure Tl n on {£„ : 

||^n — Sonlli? < <^n}- 

By Jensen's inequality, 



/I n 
\ log isons; 1 ! - - vliK 1 - ^on)u 



n n (d£ n ) 



Letting Q { = yji^ 1 -S^ 1 )^, one clearly has E Son Qi = tr(E nS n 1 - I p ). 
Let Wi = Qi - ^(SonS- 1 - I p ) and 5 n = YJl=i w i- We first show: 

Lemma A. 3. // ||E n — Sonlli? < 5 n with 5 n /s m m(^on) — > as n — > oo, 
then for sufficiently large n, 

log lEo^ 1 ! - trCEonS- 1 - l Pn ) > -C- lOg{0X 



°min (Eon) 

/or some absolute constant C > and o n = s m ax(Son)/smin(Eon)- 

\/2 1/2 

Proof. Let H n denote the symmetric positive definite matrix S n ' T,Q n T, n ' 
with eigenvalues ipj > 0, j = 1, . . . ,p n . Clearly, 

log ISonS" 1 ! - ^(SonS- 1 - LjJ 

Pn 

(A.3) = log |iT n | - tr(fl- n - I P J = Y, [log " " 1)] • 

3=1 

Consider the function /^(x) = log x— (x— l)+/3(x— 1) 2 /2 on (0, oo) for /3 > 1. 
Clearly, hp(l) = 0, fr^(x) = (x - l)(/3 - 1/x) and hp(x) = P - 1/x 2 . Thus 
hp has a local minima at x = 1 and is monotonically increasing on (l,oo). 
Moreover, hp has a local maximum at 1//3 and the function is monotonically 
increasing on (0, 1//3) and monotonically decreasing on (1//3, 1). Since h(l) = 
0, this implies hp(l//3) > and hp has the property that if hp(x*) > 0, then 
hp(x) > for all x > x*. Now suppose e £ (0,1/2). We shall show that 
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hp{x) > for all x > e if /3 = 81og(l/e). Based on the above discussion, it 
suffices to show that hg(e) > 0, which follows since hp{e) = log(l/e)[4(l — 
e) 2 -l] + (l-e) >0. 

Using (hi) in Lemma 5.1, s m \ a (H ri ) > s m in(£on)/ ||E n || 2 . Further, since 
— S „|| 2 < S n , ||S n || 2 < ll s 0n|| 2 + <5n < 2 ll s 0n|| 2 since 5 n G (0,1). 
Thus, choosing e n = s min (Eo n )/{2s max (T, 0n )}, ipj > £n for all j = 1, . . . ,p n 
and the analysis in the preceding paragraph shows logtpj — (ipj — 1) > 
— Clog(l/e n )(^j — l) 2 . Using (Rl) in Lemma 5.2, we thus obtain that the 
quantity in (A. 3) is bounded below by — Clog(l/e n ) \\H n — l p „\\ 2 F . 

Further, by (i) in Lemma 5.1, \\H n - l Pn \\ 2 F < ||S n - S n||^ /i>min(£n) 2 }- 
We next proceed to lower-bound s m ; n (S n ). Using ||AB|| 2 > s m i n (A) \\B\\ 2 
from (ii) in Lemma 5.1, ||S n - Xon|| 2 > s m m(S n) ||£on S n - I Pn || 2 , implying 
||Son S n - : pn|| 2 ^ ^/smin(Son)- Since (5 n /s min (S 0n ) < 1 by assumption, 
and the singular values of T,Q^T, n — l Pn are — 1| by similarity, it follows 

that Smin^Q^En) > 1 — 5 n /s m ; n (So n ) • Invoking (hi) of Lemma 5.1, we finally 
get s min (E n ) > s min (T, 0n ){l-5 n /s min (T, 0n )} > s min (T, 0n )/2 for n sufficiently 
large. 

□ 

Using Lemma A. 3, one has 



/T 1 n 

^ log l^OnK'l ~ ^OnK 1 ~ lp) - - Y,Wi 

i=l 



1 

> - 

~ 2 



n„(cffi n ) 



Cn5 n ^fjf" no ~ / S n Il n (dE r 



3 nun (Son, 



Set A n = {yW : \S n \ < C^n log n <5 n /s min (Son)}- Since n<5 2 /s min (X!on) 2 -> 
oo, n(5 2 /s min (Son) 2 > \/nS n /s min (T, n) for n large. Hence, on A n , D n > 
e -Cn52/ Smin (E „) 2 _ j t t h us rema i ns to show that F 0n (A^) -> 0. To that end, let 

Co n = IEs „ (Wi)- By a standard result on quadratic forms, Cg n = W-^n ~ Ip™ II f ^ 
C<5 2 /s m m(£on) 2 previous calculations. The proof is completed by an ap- 
plication of Markov's inequality: 

P Eon K) = Psoni^n > Clog(n)n,5 2 / Smin (So n ) 2 } < 
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Proof of Lemma 6.2. With B n = {|supp^ (A n )| > Hlogp n } and using 
Lemma 6.1, 



E^ 0n [U n (B n \y^)l An ] = E Eo 



f Bn nr=i fe w „ ( £o <m "( s ") 



l A n 



< Ee 



rU£? n ) 

e -naa/.»i»(Eon) a P(||S n - £ n|| F < S n ) ' 

Using Lemma 4.3, U n (B n ) < e " c ' losP ". By (A4), s min (S „) is bounded 
below by a constant and P(||S n - SonlU < <>n) > e~ cl ° SPn by Lemma 6.5. 

Proof of Lemma 6.3. The proof for the second part follows along the 
same lines as the in proof for Lemma 6.2, observing that for large t n , 

h a f 00 

P{cj 2 >t n ) < / e~ bx x a ~ l dx 

r ( a ) Jt n 



b" 



oo 



r(a) J tn 

< Ce~ c ' tn . 

For the first part, note that ||A n || 2 < C ||vec(A n )|| 1 and P(||vec(A n )|| 1 > 
t n ) < e- closPn by Lemma 4.5. 
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